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Abstract 

Interactive multiview video applications endow users with the freedom to navigate through neighboring viewpoints in a 3D 
scene. To enable such interactive navigation with a minimum view-switching delay, multiple camera views are sent to the users, 
which are used as reference images to synthesize additional virtual views via depth-image-based rendering. In practice, bandwidth 
constraints may however restrict the number of reference views sent to clients per time unit, which may in turn limit the quality 
of the synthesized viewpoints. We argue that the reference view selection should ideally be performed close to the users, and 
we study the problem of in-network reference view synthesis such that the navigation quality is maximized at the clients. We 
consider a distributed cloud network architecture where data stored in a main cloud is delivered to end users with the help of 
cloudlets, i.e., resource-rich proxies close to the users. In order to satisfy last-hop bandwidth constraints from the cloudlet to 
the users, a cloudlet re-samples viewpoints of the 3D scene into a discrete set of views (combination of received camera views 
and virtual views synthesized) to be used as reference for the synthesis of additional virtual views at the client. This in-network 
synthesis leads to better viewpoint sampling given a bandwidth constraint compared to simple selection of camera views, but 
it may however carry a distortion penalty in the cloudlet-synthesized reference views. We therefore cast a new reference view 
selection problem where the best subset of views is defined as the one minimizing the distortion over a view navigation window 
defined by the user under some transmission bandwidth constraints. We show that the view selection problem is NP-hard, and 
propose an effective polynomial time algorithm using dynamic programming to solve the optimization problem under general 
assumptions that cover most of the multiview scenarios in practice. Simulation results finally confirm the performance gain offered 
by virtual view synthesis in the network. It shows that cloud computing resources provide important benefits in resource greedy 
applications such as interactive multiview video. 


Index Terms 

Depth-image-based rendering, network processing, cloud-assisted applications, interactive systems. 


I. Introduction 

Interactive free viewpoint video systems |[T| endow users with the ability to choose and display any virtual view of a 3D 
scene, given original viewpoint images captured by multiple cameras. In particular, a virtual view image can be synthesized 
by the decoder via depth-image-based rendering (DIBR) O using texture and depth images of two neighboring views that act 
as reference viewpoints. One of the key challenges in interactive multiview video streaming (IMVS) O systems is to transmit 
an appropriate subset of reference views from a potentially large number of camera-captured views such that the client enjoys 
high quality and low delay view navigation even in resource-constrained environments 

In this paper, we propose a new paradigm to solve the reference view selection problem and capitalize on cloud computing 
resources to perform fine adaptation close to the clients. We consider a hierarchical cloud framework, where the selection of 
reference views is performed by a network of cloudlets, i.e., resource-rich proxies that can perform personalized processing at 
the edges of the core network ||71, O. An adaptation at the cloudlets results in a smaller round-trip time (RTT), hence more 
reactivity than in more centralized architectures. Specifically, we consider the scenario depicted in Fig. [T] where a main cloud 
stores pre-encoded video from different cameras, which are then transmitted to the edge cloudlets that act as proxies for final 
delivery to users. We assume that there is sufficient network capacity between the main cloud and the edge cloudlets for the 
transmission of all camera views, but there exists however a bottleneck of limited capacity between a cloudlet and a nearby 
usei0. In this scenario, each cloudlet sends to a client the set of reference views that respect bandwidth capacities and enable 
synthesis of all viewpoints in the client’s navigation window. This window is defined as the range of viewpoints in which the 
user can navigate during the RTT and enables zero-delay view-switching at the client. 

We argue that, in resource-constrained networks, resampling the viewpoints of the 3D scene in the network— i.e., syn¬ 
thesizing novel virtual views in the cloudlets that are transmitted as new references to the decoder—is beneficial compared 
to the mere subsampling of the original set of camera views. We illustrate this in Fig. [T] where the main cloud stores three 
coded camera views: {vi,V 2 ^vf\ while the bottleneck links between cloudlet-user pairs can support the transmission of only 
two views. H If user I requests a navigation window [u 2 a^'^ 2 .s]^ the cloudlet can simply forward the closest camera views 

L. Toni, and R Frossard are with Ecole Polytechnique Federate de Lausanne (EPFL), Signal Processing Laboratory - LTS4, CH-1015 Lausanne, Switzerland. 
Email: {laura.toni, pascal. frossard}@epfl. ch. 

Gene Cheung is with the National Institute of Informatics, Tokyo, Japan. Email Address: cheung@nii .ac.jp 

Tn practice, the last-mile access network is often the bottleneck in real-time media distribution. 

^We consider integer index i for any camera view, while we assume that a virtual view can have a non-integer index i.x, which corresponds to a position 
between camera views vi and 
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Fig. 1. Considered scenario. Green lines represent no bandwidth constrained channels, red lines are bottleneck channels. 


V 2 and ^ 3 . However, if user 2 requests the navigation window [txi.8,^ 2 . 2 ], transmitting camera views vi and results in 
large synthesized view distortions due to the large distance between reference and virtual views (called reference view distance 
in the sequel). Instead, the cloudlet can synthesize virtual views ui,s and U 2.2 using camera views ^ 1 ,^ 2,^3 and send these 
virtual views to the user 2 as new reference views for the navigation window 'U 2 . 2 ]. This strategy may result in smaller 
synthesized view distortion due to the smaller distance to the reference views. However, the in-network virtual view synthesis 
may also introduce distortion into the new reference views ui,s and U 2 . 2 , which results in a tradeoff that should be carefully 
considered when choosing the views to be synthesized in the cloudlet. 

Equipped with the above intuitions, we study the main tradeoff between reference distortion and bandwidth gain. Using a 
Gauss-Markov model, we first analyze the benefit of synthesizing new reference images in the network. We then formulate a 
new synthesized reference view selection optimization problem. It consists in selecting or constructing the optimal reference 
views that lead to the minimum distortion for all synthesized virtual views in the user’s navigation window subject to a 
bandwidth constraint between the cloudlet and the user. We show that this combinatorial problem can be solved optimally but 
that it is NP-hard. We then introduce a generic assumption on the view synthesis distortion which leads to a polynomial time 
solution with a dynamic programming (DP) algorithm. We then provide extensive simulation results for synthetic and natural 
sequences. They confirm the quality gain experienced by the IMVS clients when synthesis is allowed in the network, with 
respect to scenarios whose edge cloudlets can only transmit camera views. They also show that synthesis in the network allows 
to maintain good navigation quality when reducing the number of cameras as well as when cameras are not ideally positioned 
in the 3D scene. This is an important advantage in practical settings, which confirms that cloud processing resources can be 
judiciously used to improve the performance of applications that are a priori quite greedy in terms of network resources. 

The remainder of this paper is organized as follows. Related works are described in Section HH In Section [Till we provide 
a system overview and analyze the benefit of in-network view synthesis via a Gauss-Markov model to impart intuitions. The 
reference view selection optimization problem is then formulated in Section [IVl We propose general assumptions on view 
synthesis distortion in Section [V| and derive an additional polynomial time view selection algorithm. In Section we discuss 
the simulation results, and we conclude in Section rviii 

11 . Related Work 

Prior studies addressed the problem of providing interactivity in selecting views in IMVS, while saving on transmitted 
bandwidth and view-switching delay O, ll9l- l[T5ll . These works are mainly focused on optimizing the frame coding structure 
to improve interactive media services. In the case of pre-stored camera views, however, rather than optimal frame coding 
structures, interactivity in network-constrained scenario can be addressed by studying optimal camera selection strategies, 
where a subset of selected camera views is actually transmitted to clients such that the navigation quality is maximized and 
resource constraints are satisfied Ei-ia, im-cii. In im, an optimal camera view selection algorithm in resource-constrained 
networks has been proposed based on the users’ navigation paths. In ll2Qll a bit allocation algorithm over an optimal subset of 
camera views is proposed for optimizing the visual distortion of reconstructed views in interactive systems. Finally, in 1^ . 
12211 authors optimally organize camera views into layered subsets that are coded and delivered to clients in a prioritized fashion 
to accommodates for the network and clients heterogeneity and to effectively exploit the resources of the overlay network. 
While in these works the selection is limited to camera views, in our work we rather assume in-network processing able to 
synthesize virtual viewpoints in the cloud network. 

In-network adaptation strategies allow to cope with network resource constraints and are mainly categorized in i) packet-level 
processing and ii) modification of the source information. In the first category, packet filtering, routing strategies 1^ . l24ll or 
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caching of media content information 1251 . 1261 allow to save network resources while improving the quality experienced by 
clients. To better address media delivery services in highly heterogenous scenarios, network coding strategies for multimedia 
streaming have been also proposed Eli - 1^ . In the second category — in-network processing at the source level — the main 
objective is usually to avoid transmitting large amounts of raw streams to the clients by processing the source data in the 
network to reduce both the communication volume and the processing required at the client side. Transcoding strategies might 
be collaboratively performed in peer-to-peer networks or in the cloud lISTl . Furthermore, source data can be compressed 
in the cloud 1301 . l32l . |3^ to efficiently address users’ requests. Rather than media processing in the main cloud, offloading 
resources to a cloudlet, i.e., a resource-rich machine in the proximity of the users, might reduce the transmission latency Q, 
la. This is beneficial for delay-sensitive / interactive applications 1^ - 1^ . Because of the proximity of cloudlets to users, 
cloudlet computing has been under intense investigation for cloud-gaming applications, as shown in lJ7l and references there 
in. The above works are mainly focused on multimedia processing, rather than on specific multiview scenarios. However, the 
use of cloudlets in delay sensitive applications motivates the idea of cloudlet-based view synthesis for IMVS. 

Cloud processing for multiview system is considered in |[38l - ll4Ql . In 1^ authors mainly address the cloud-based processing 
from a security perspective. In ll40ll . view synthesis in the network has been introduced for cloud networks to offload clients’ 
terminals (in terms of complexity). The desired view is synthesized in the cloud and then sent directly to clients. However, 
only the view requested by the client is synthesized. This means that either the desired view is a priori known at the source or 
a switching delay is experienced by the clients. To the best of our knowledge, none of the work investigating cloud processing 
have considered the problem of multi-view interactive streaming under network resource constraints. In our work, we propose 
view synthesis in the network mainly to both overcome uncertainty of users’ requests in interactive systems and to cope with 
limited network resources. 


HI. Background 

A. System Model 

Let V = {t’l ,... ^vn} be the set of the N camera viewpoints captured by the multiview system. For all camera-captured 
views, compressed texture and depth maps are stored at the main cloud, with each texture/depth map pair encoded at the same 
rate using standard video coding tools like H.264 1411 or HE VC 14^ . The possible viewpoints offered to the users are denoted 
hyU = , r^Ar}. The set U contains both synthesized views and camera views for navigation between the leftmost 

and rightmost camera views, vi and It is equivalent to offering views u = kd, where k is a. positive integer and 6 is a. 
pre-determined fraction that describes the minimum view spacing between neighboring virtual views. We consider that any 
virtual viewpoint u can be synthesized using a pair of left and right reference view images vl and vr, vl < u < vr, via 
a known DIBR technique such as 3D warping ll43ll . 

Each user is served by an assigned cloudlet through a bottleneck link of capacity C, expressed in number of views. Assuming 
a RTT of T seconds between the cloudlet and the user, and a maximum speed p at which a user can navigate to neighboring 
virtual views, one can compute a navigation window W{u) = [u — pT, u + pT], given that the user has selected virtual view 
u at some time to. The goal of the cloudlet is to serve the user with the best subset of C viewpoints in U that synthesize the 
best quality virtual views in W{u). In this way, the user can experience zero-delay view navigation at time to + T (see iffflil 
for details) with optimized visual quality. 


B. Analysis of Cloudlet-based Synthesized Reference View 

To impart intuition of why synthesizing new references at in-network cloudlets may improve rendered view quality at an 
end user, we consider a simple model among neighboring views. Similarly to ll44ll . ll45ll . we assume a Gauss-Markov model, 
where variable Xy at view v is correlated with Xy-i\ 

Xy = Xy-l -\- Cy^ \fV > 2 ( 1 ) 

where Cy is a zero-mean independent Gaussian variable with variance cr^, and xi = ei. A large would mean views Xy and 
Xy-i are not similar. We can write N variables xi,..., xat in matrix form: 

Fx = e, X = F“^e (2) 

where 

"1 0 ... 

-1 1 0 ... 

0-11 0 .. 

_ 0 ... 0 -1 1 

^Note that view synthesis can be performed in-network (to generate new reference views) or at the user side (to render desired views for observation). In 
both cases, the same rendering method and distortion model apply. 



Xi 


ei 

X = 


, e = 



XjSf 
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Given x is zero-mean, the covariance matrix C can be computed as: 


C = £;[xx'^] = 


( 4 ) 


where £'[ee^] = diag{ai,..., aj^) is a diagonal matrix. The precision matrix Q is the inverse of C and can be derived as 
follows: 


Q = C ^ = (F diag{cT\,...,a%) {Y ^)'^) ^ 
= F^ diag{al,...,(j%)~^ F 



(5) 


which is a tridiagonal matrix. 

When synthesizing a view Xn using its neighbors Xn-i and x^+i, we would like to know the resulting precision. Without 
loss of generality, we write x as a concatenation of two sets of variables, i.e. x = [y z]. It can be shown ll46l that the 
conditional mean and precision matrix of y given z are: 


/^y|z — i^y QyyQyz (z /iz) 

Qy|z = Qyy (6) 


Consider now a set of four views xi,X 2 ,X 3 ,X 4 , where xi,X 2 ,X 4 are camera views transmitted from the main cloud. 
Suppose further that the user window is [1.8, 2.2], and the cloudlet has to choose between using received x^ as right reference, 
or synthesizing new reference xs using received X 2 and x^. Using the discussed Gauss-Markov model 0 and the conditionals 
([b]), we see that synthesizing xs using reference X 2 and x^ results in precision: 

1 1 

Q3|(2,4) = Q 33 = ^ H- 2 

0-3 0-4 


I/Q33 is thus the additional noise variance when using new reference x^ to synthesize X 2 . We can then compute the conditional 
precision Q 2 |(i, 3 ) given new reference x^: 


Q2|(1.3) - T2 + 
^2 



( 8 ) 


In comparison, if a user uses received X 4 as right reference, X 4 will accumulate two noise terms from X 2 to x^: 


X4 = X2 + 63 -h 64 


(9) 


The resulting conditional precision of X 2 given xi and X 4 is: 


1 


Q2|(1.4) - -2 + 


(T$ 


( 10 ) 


We now compare Q 2 |(i, 3 ) ® with Q 2 |(i, 4 ) in (flOb . We see that if <73 is very large relative to al, then ^ j ^ al, 

and Q 2 |(i, 3 ) ~ Q 2 |(i, 4 )- That means that if view xs is very different from X 2 , then synthesizing new reference xs does not 

help improving precision of X 2 . However, if ^ < cx), then < cri, and Q 2 |(i, 3 ) > Q 2 \(i, 4 )^ which means that in 

general it is worth to synthesize new reference X 3 . The reason can be interpreted from the derivation above: by synthesizing xs 
using both X 2 and X 4 , the uncertainty (variance) for the right reference has been reduced from cr? to ( A + A ) , improving 

\ ^3 ^4 / 

the precision of the subsequent view synthesis. 


IV. Reference View Selection Problem 

In this section, we first formalize the synthesized reference view selection problem. We then describe an assumption on the 
distortion of synthesized viewpoints. We conclude by showing that under the considered assumption the optimization problem 
is NP-hard. 
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A. Problem Formulation 

Interactive view navigation means that a user can construct any virtual view within a specified navigation window with zero 
view-switching delay, using viewpoint images transmitted from the main cloud as reference M- We denote this navigation 
window by [f/£, U^] that depends on the user’s current observed viewpoint. If bandwidth is not a concern, for best synthesized 
view quality the edge cloudlet would send to the user all camera-captured views in V as reference to synthesize virtual view 
u, Mu G [f/£, U^. When this is not feasible due to limited bandwidth C between the serving cloudlet and the user, among all 
subsets T <ZU of synthesized and camera-captured views that satisfy the bandwidth constraint, the cloudlet must select the 
best subset T* that minimizes the aggregate distortion 'D{T) of all virtual views u G [f/£, U^], i.e., 


T"*" : argnfinP(T) (11) 

s.t in < c 
TCU 

We note that (fTTl) differs from existing reference view selection formulations ifTTl . ll22l in that the cloudlet has the extra 
degree of freedom to synthesize novel virtual view(s) as new reference(s) for transmission to the user. 

Denote by D{v) the distortion of viewpoint image v, due to lossy compression for a camera-captured view, or by DIBR 
synthesis for a virtual view. The distortion D(T) experienced over the navigation window at the user is then given by 

V{T)= Y] mm {du{vL,VR,D{vL),D{vR))] (12) 

^VL,VReT 

ue[ul,u%] 

where D{vl) and D{vr) are the respective distortions of the left and right reference views and du{vL,VR, Dr, Dr) is the 
distortion of the virtual view u synthesized using left and right reference views vr and vr with distortions Dr and Dr, 
respectively. In ([12]), for each virtual view u the best reference pair in T is selected for synthesis. Note that, unlike ifTT]! . the 
best reference pair may not be the closest references, since the quality of synthesized u depends not only on the view distance 
between the synthesized and reference views, but also on the distortions of the references. 


B. Distortion of virtual viewpoints 

We consider first an assumption on the synthesized view distortion du {) called the shared optimality of reference views: 

if du{vL, Vr, D{vl), D{vr)) < du{vR v'j^, D{v'R, D{v'ii)) (13) 

then du'{vL,VR,D{vL),D{vR)) < du<{v'R,v'R, D{v'R, D{v'r)) 

for maxjvi, v'R <u,u'< In words, this assumption (fTSt states that if the virtual view u is better synthesized 

using the reference pair {vl,vr) than then another virtual view u' is also better synthesized using {vr,vr) than 

We see intuitively that this assumption is reasonable for smooth 3D scenes; a virtual view u tends to be similar to its 
neighbor u', so a. good reference pair (vr^vr) for u should also be good for u'. We can also argue for the plausibility of this 
assumption as a consequence of two functional trends in the synthesized view distortion dy () that are observed empirically to 
be generally true. For simplicity, consider for now the case where the reference views vr^vr^v'^^v'j^ have zero distortion, i.e. 
D{vr) = D{vr) = D{v'^) = D{v'jf) = 0. The first trend is the monotonicity in predictor's distance ll20l : i.e., the further-away 
are the reference views to the target synthesized view, the worse is the resulting synthesized view distortion. This trend has 
been successively exploited for efficient bit allocation algorithms ||20l, BTl . In our scenario, this trend implies that reference 
pair {vr^vr) is better than at synthesizing view u because the pair is closer to u, i.e. 

\u - Vr\ -h \vr -u\<\u- v'r\ -h \v'r - u\ ( 14 ) 

where maxjuL,'^^} <u < mm{vR,v'j^}. 

It is easy to see that if reference pair (vr^vr) is closer to u than it is also closer to u', thus better at synthesizing 

u'. Without loss of generality, we write new virtual view u' sls u' = u S. We can then write: 


\{u F S) - vr \ F \vr - {u F S)\ = u-vr-\-vr-u 

< U — Vr Vr — U 

< \{u 5) - v'r \ \v'r - {u F S)\ (15) 


where m8ix{vR,v'j^} < u' < min{'i;i?, 

Consider now the case where the reference views vr^vr^v'^^v'j^ have non-zero distortions. In ll48]| . another functional trend is 
empirically demonstrated, where a reference view vr with distortion D{vr) was well approximated as a further-away equivalent 
reference view v^ < vr with no distortion D{v^) = 0. Thus a better reference pair {vr^vr) than at synthesizing u 
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V (reference view) 



Fig. 2. Reference view assignment in (a) contradicts the shared reference assumption. Reference view assignment in (b) respects the shared reference 
assumption but contradicts the independence of reference optimality assumption. 


just means that the equivalent reference pair for (vl^vr) are closer to u than the equivalent reference pair for Using 

the same previous argument, we see that the equivalent reference pair for (vr^vr) are also closer to u' than resulting 

in a smaller synthesized distortion. Hence, we can conclude that the assumption of shared optimality of reference views is a 
consequence of these two functional trends. 

We can graphically illustrate possible solutions to the optimization problem ([TT]) under the assumption of shared optimality 
of reference views. Fig. |2(a)| depicts the selected reference views for virtual views in the navigation window. In the figure, 
the x-axis represents the virtual views in the window [f/£, U^] that require synthesis. Correspondingly, on the ^-axis are two 
piecewise constant (PWC) functions representing the left and right reference views selected for synthesis of each virtual view u 
in the window, assuming that for each u G [f/£, U^] there must be one selected reference pair {vr^vr) such that vr < u < vr. 
A constant line segment— e.g., v = vi for < u < vs in Fig. |2(a)| —means that the same reference is used for a range of 
virtual views. This graphical representation results in two PWC functions—left and right reference views—above and below 
the u = V line. The set of selected reference views are the unions of the constant step locations in the two PWC functions. 

Under the assumption of shared reference optimality we see that the selected reference views in Fig. |2(a)| cannot be an 
optimal solution. Specifically, virtual views vs — 1/L and vs employ references [^ 1 ,^ 4 ] and [^ 2 ,^ 5 ] respectively. However, if 
references [^ 1 ,^ 4 ] are better than [^ 2 ,^ 5 ] for virtual view vs — 1/L, they should be better for virtual view vs also according 
to shared reference optimality in ([13]). An example of an optimal solution candidate under the assumption of shared reference 
optimality is shown in Fig. |2(b)| 

C. NP-hard Proof 

We now outline a proof-by-construction that shows the reference view selection problem (dUi is NP-hard under the shared 
optimality assumption. We show it by reducing the known NP-hard set cover (SC) problem HD to a special case of the 
reference view selection problem. In SC, a set of items S (called the universe) are given, together with a defined collection 
C of subsets of items in S. The SC problem is to identify at most K subsets from collection C that covers 5, i.e., a smaller 
collection C' C C with \C'\ < K such that every item in S belongs to at least one subset in collection C'. 

We construct a corresponding special case of our reference view selection problem as follows. For each item i in 5 = 
{1,..., |5|} in the SC problem, we first construct an undistorted reference view i. In addition, we construct a default undistorted 
right reference view |5| + 1, and the navigation window is set to [f/£, U^] = [1, |5| + 1] and L = 2. Further, for each item i 
in S, we construct a virtual view i + ^ that requires the selection of left reference i, in combination of default right reference 
\S\ +1, for the resulting synthesized view distortion (i, \S\ + 1,0,0) to achieve distortion D < oc. Thus the selection 
of 151 left references and one default right reference |5| + 1 consumes |5| + 1 views worth of bandwidth already. See Fig. [3] 
for an illustration. Note that given this selection of left reference views, any selection of right reference views will satisfy the 
shared optimality of reference views assumption. 

For each subset j in collection C = {1,..., |C|} in the SC problem, we construct a right reference view |5| + 1 + j, such 
that if item i belongs to subset j in the SC problem, the synthesized distortion (i, |5| - 1 - 1 + j, 0, 0) at virtual view 1 + ^ 
will be reduced to D — A given right reference view |5| + 1 - 1 -j is used. The corresponding binary decision we ask is: given 
channel bandwidth of \S\ + 1 + LT, is there a reference view selection such that the resulting synthesized view distortion is 
\S\{D — A) or less? 

From construction, it is clear that to minimize overall distortion, left reference views 1,..., |5| and default right reference 
view |tS| + 1 must be first selected in any solution with distortion < 00. Given remaining budget of K additional views, if 
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V (reference view) 



Fig. 3. Example of items set S and collection of sets C, with |*S| = 5 and \C\ =4. 

distortion of |5| (5 — A) is achieved, that means K or fewer additional right reference views are selected to reduce synthesized 
distortion from 5 to 5 — A at each of the virtual view 7 + iG |5|}. Thus these additionally K or fewer selected 

right reference views correspond exactly to the subsets in the SC problem that covers all items in the set S. This solving this 
special case of the reference view selection problem is no easier than solving the SC problem, and therefore the reference 
view selection problem is also NP-hard. □ 

V. Optimal View Selection Algorithm 

Given that the reference view selection problem (fTTI) is NP-hard under the assumption of shared optimality of reference 
views, in this section we introduce another assumption on the synthesized view distortion that holds in most common 3D 
scenes. Given these two assumptions, we show that ([TT]) can now be solved optimally in polynomial time by a DP algorithm. 
We also analyze the DP algorithm’s computation complexity. 

A. Independence of reference optimality assumption 

The second assumption on the synthesized view distortion du{) is the independence of reference optimality, stated formally 
as follows: 


if du{vL,VR, D{vl), D{vr)) < du{v'R,VR, D{v'^), D{vr)) (16) 

then du{vL,VR,D{vL),D{vR)) < du{v'R,v'R,D(v'f},D(v'R)) 

for ma.x{vL,Vi} < u < inin{t>/{, In words, the assumption (fT^ states that if ul is a better left reference than when 
synthesizing virtual view u using vr as right reference, then vr remains the better left reference to synthesize u even if a 
different right reference is used. This assumption essentially states that contributions towards the synthesized image from 
the two references are independent from each other, which is reasonable since each rendered pixel in the synthesized view is 
typically copied from one of the two references, but not both. We can also argue for the plausibility of this assumption as a 
consequence of the two aforementioned functional trends in the synthesized view distortion dv{) in Section [IVl Consider first 
the case where the reference views have zero distortion. The monotonicity in predictor’s distance in dTH) for a 

common right reference view becomes 

\u-VL\ + \vR-u\<\u-v'f-\-\vR-u\ — > \u-vl\<\u-v'r\ (17) 

where maxli;^, < u < vr. Thus if vr is preferred to v'j^ for vr > u, it will hold also for as long as v'^ > u. Consider 
now the case where the reference views vr^vr^Vr^Vr have non-zero distortions. Introducing the equivalent reference views 
'^R < with no distortion D{vf) = 0, the same argument of ([TT]) holds for the equivalent reference views, leading to 

~ '^rI ^ 1 '^ ~ 

We illustrate different optimal solution candidates to (fTTI) now under both virtual view distortion assumptions to impart 
intuition. We see that the assumption of independence of reference optimality would prevent the reference view selection in 



























V (reference view) v (reference view) 




Fig. 4. Reference view assignments in (a) and (b) are optimal solution candidates under both assumptions. We name these two cases “shared-left” and 
“shared-right”, respectively. 


Fig. |2(b)| from being an optimal solution. Specifically, we see that both ug and are feasible right reference views for virtual 
views V 2 — 1/1/ and V 2 . Regardless of which left references are selected for these two virtual views, if vs is a strictly better 
right reference than t’ 4 , then having both virtual views select vs as right reference will result in a lower overall distortion 
(and vice versa). If vs and are equally good right reference views resulting in the same synthesized view distortion, then 
selecting just v/^ without can achieve the same distortion with one fewer right reference view. Thus the selected reference 
views in Fig. |2(b)| cannot be optimal. 

We can thus make the following observation: as virtual view u increases, an optimal solution cannot switch right reference 
view from current vr earlier than u = vr. Conversely, as virtual view u decreases, an optimal solution cannot switch left 
reference view from current vr earlier than u = vr — 1/L. examples. Fig. [4] provides solutions of left and right reference 

views for virtual views in the navigation window. In the figure, on the x-axis are the virtual views u in the window 
that require synthesis. Correspondingly, on the ^-axis are the left and right reference views (blue and red PWC functions 
respectively) selected to synthesize each virtual view u in the window. We see that the reference view selections in Fig. |4(a)| 
and Fig. |4(b)| are optimal solution candidates to (fTTI) . Thus, the optimal reference view selections must be graphically composed 
of “staircase” virtual view ranges as shown in Fig. |4(a)| and Fig. |4(b)[ In other words, either a shared left reference view 
is used for multiple virtual view ranges [ui^UiJ^i) where each range has the same as left reference (“shared-left” case), 
or a shared right reference view is used for multiple ranges [ui,Ui-^i), where each range has as its right reference 
(“shared-right” case). This motivates us to design an efficient DP algorithm to solve STT\i optimally in polynomial time. 

B. DP Algorithm 

We first define a recursive function ^{uR^VR^k) as the minimum aggregate synthesized view distortion of views between ur 
and given vr the selected left reference view for synthesizing view ur, and there is a budget of k additional reference 
views. To analyse ^{uR,VR,k), wq consider the two “staircase” cases identified by Fig. |4(a)| and Fig. |4(b)| separately, and show 
how ^{uR,VR,k) can be evaluated in each of the cases. 

Consider first the “shared-left” case (Fig. |4(a)| ) where a shared left reference view is employed in a sequence of virtual view 
ranges. A view range represents a contiguous range of virtual viewpoints that employ the same left and right reference views. 
The algorithm selects a new right reference view v, v > ur, creating a new range of virtual views [ur^v). Virtual views in 
range [ur,v) are synthesized using a shared left reference vr and the newly selected reference view v, resulting in distortion 
D{vr), D{v)) for each virtual view u, ur < u < v. The aggregate distortion function ^{uR,VR,k) for this case 
is the distortion of views in [ur^v) plus a recursive term A{vR,v),k — 1) to account for aggregate synthesized view 
distortions to the right of v: 

V du{vL,v,D{vL),D{v}) + ^{v,A{vL,v),k - 1) (18) 

U=UL 

where k — 1 is the remaining budget of additional reference views, and A{vi^V2) chooses the better of the two left reference 
views, vi and V2, for the recursive function <l>(). In particular, using any right reference view vr and virtual view u, where 
max{i;i,i;2} < u < vr, we set A{vi,V2) = vi if virtual view u is better synthesized using vi as left reference than V2 (and 
set A{yi^V 2 ) = ^2 otherwise). Formally, the left reference selection function A{vi^V 2 ) is defined as: 

Vi if du{vi,VR,D{vi),D{vR)) < du{v2,VR,D{v2),D{vR)) 

V2 O . W . 


A{vi,V2) = 


(19) 
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Given our two assumptions, we know that the selected left reference K{vi^V 2 ) remains better for all other virtual views u in 

[max{vi,V2},VR]. 

We now consider the “shared-right” case (Fig. |4(b)| ) where a newly selected view v is actually a common right reference 
view for a sequence of virtual view ranges from ul to v. We first define a companion recursive function that 

returns the minimum aggregate synthesized view distortion from view ur to vr, given that vr is the selected left reference 
view, vr is the common right reference view, and there is a budget of n other left reference views in addition to We can 
write ti/[uR,VR,VR,n) recursively as follows: 

f 

min du{vR,VR,D{vR),D{vR)) ^t^{v,v,VR,n -1) if k > 1 

tf/ {ur ,VR,VR,n) = I ^ 20 ) 

5]; du{vR,VR,D{vR),D{vR)) o.w. 

V U=VL 

In more details, the equation (l20l) states that "^{ur^vr^vr^u) is the synthesized view distortion of views in the range [ur^v), 
plus the recursive distortion tff(^v^v,VR^n — 1) from view v to vr with a reduced reference view budget n — 1. 

We can now put the two cases together into a complete definition of ^{uR^VR^k) as follows: 


^{uR,VR,k) = min < 

V>VL 


du(vL,v, D{vl), D{v)) + A{vl,v), k - 1), 


“shared-left” case 


min 'i;, n)-h 'i;,/c — n — 1) 

l<n<k—l 


“shared-right” case 


( 21 ) 


The relation (1211) states that ^{uR^VR^k) examines each candidate reference view v, v > vr, which can be used either as right 
reference for synthesizing virtual views in [ur^v) with left reference vr (“shared-left” case), or as a common right reference 
for a sequence of n + 1 virtual view ranges within the interval [ur^v) (“shared-right” case). 

When the remaining view budget is /c = 1, the relation in (1211) ^{ur^vr^I) simply selects a right reference view v, v >U%, 
which minimizes the aggregate synthesized view distortion for the range [ul, U^]: 


^{ul,vl, 1) 


min y” du{vL,v,D{vL),D{v)) 

v>U% 

— « U=Ul 


( 22 ) 


Having defined ^{uR^VR^k), we can identify the best K reference views by calling K) repeatedly to identify the 

best leftmost reference view v, v <U^, and start the selection of the K — 1 remaining reference views as follows 


mm -1) (23) 


C Computation Complexity 

Our proposed DP algorithm requires two different tables to be stored. The first time tff{uR^VR^VR^n) is computed, the result 
can be stored in entry [{ur — U^)/L][{vr — U^)/L][{vr — U^)/L][n] of a DP table ^*, so that subsequent calls with the 
same arguments can be simply looked up. Analogously, the first time ^{uR^VR^k) is called, the computed value is stored in 
entry [{ur — U^)/L][{vr — U^)/L][k] of another DP table to avoid repeated computation in future recursive calls. 

We bound the computation complexity of our proposed algorithm (1211) by computing a bound on the sizes of the required DP 
tables and the cost in computing each table entry. For notation convenience, let the number of reference views and synthesized 
views hQ Sy = {V — 1)/L and Su = — U^)/L, respectively. The size of DP table is no larger than Su x Sy x K. 

The cost of computing an entry in using (l2ll) over all possible reference views v involves the computation of the “shared- 
left” case with complexity 0{Su) and the one of the “shared-right” case with complexity 0{K). Thus, each table entry has 
complexity 0{SvSu + SyK). Hence the complexity of completing the DP table is 0{SlS^K + SuS^K‘^). Given that in 
typical setting Su ^ K, the complexity for computing DT table is thus 0{S^SyK). 

We can perform similar procedure to estimate the complexity in computing DP table ^*. The size of the table in this case 
is upper-bounded by Su x Sy x Sy x K. The complexity in computing each entry is 0{Su)- Thus the complexity of computing 
DP table is 0{S^SyK). which is the same as DP table $*. Thus the overall computation complexity of our solution in 
(EB is also 0{SlSlK). 
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TABLE I 

Viewpoints notation. 


Camera ID as in 1^, “Statue” 

50 

51 

52 

53 

54 

55 

56 

57 

58 

59 


98 

Camera ID as in 1501. “Mansion” 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 


73 

Camera ID in our work 

0 

1.125 

1.25 

1.375 

1.5 

1.625 

1.75 

1.875 

1 

2.125 
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VI. Simulation Results 

A. Settings 

We study the performance of our algorithm and we show the distortion gains offered by cloudlets-based virtual view 
synthesis. For a given navigation window we provide the average quality at which viewpoints in the navigation 

jjO 

window is synthesized. This means that we evaluate the average distortion of the navigation window as (1/A^) X]u=uo 
with N being the number of synthesized viewpoints in the navigation window, and we then compute the corresponding PSNR. 
In our algorithm, we have considered the following model for the distortion of the synthesized viewpoint u from reference 
views Vl, Vr 


du{VL,VR, Dr, Dr) = aDmin + (1 “ C^)PDmax + [1 - - (1 - a)/3] Dj (24) 

where Dmin = minjl^L, Dr}, D^ax = maxjl^L, Dr}, Dj is the inpainted distortion, and a = exp — Vmin\d ), P = 

exp {—j\u — Vmax\d) with d is the distance between two consecutive camera views Vi and Vi 1, Vmin — Dr < Dr, 
Vmin = Vr Otherwise, and Vmax = Vr if Dr > Dr, Vmax = Vr. The model can be explained as follows. A virtual synthesis 
u, when reconstructed from (Vr^Vr) has a relative portion a G [0,1] that is reconstructed at a distortion Dmin, from the 
dominant reference view, defined as the one with minimum distortion. The remaining portion of the image, i.e., 1 —a, is either 
reconstructed by the non-dominant reference view for a potion /3, at a distortion Dmax, or it is inpainted, at a distortion Dj. 

The results have been carried out using 3D sequences “Statue” and “Mansion” (sol, where 51 cameras acquire the scene 
with uniform spacing between the camera positions. The spacing between camera positions is 5.33 mm and 10 mm for “Statue” 
and “Mansion”, respectively. Among all camera views provided for both sequences, only a subset represents the set of camera 
views V available at the cloudlet, while the remaining are virtual views to be synthesized. Table U depicts how the camera 
notation used in ll50l is adapted to our notation. Finally, for the “Mansion” sequence, in the theoretical model in (l24l) we used 
/3 = 0.2, Dmax = 450, and d = 50, while for the “Statue” sequence we used /3 = 0.2, Dmax = 100, and d = 25. 

In the following, we compare the performance achieved by virtual view synthesis in the cloudlets with respect to the scenario 
in which cloudlets only send to users a subset of camera views. We denote by Ts the subset of selected reference views when 
synthesis is allowed in the network, and by Tns the subset of selected reference views when only camera views can be sent as 
reference views, i.e., when synthesis is not allowed in the network. For both the cases of network synthesis and no network 
synthesis, the best subset of reference views is evaluated both with the proposed view selection algorithm and with an exact 
solution, i.e., an exhaustive search of all possible combinations of reference views. For the proposed algorithm, the distortion 
is evaluated both with experimental computation of the distortion, where the results are labeled “Proposed Alg. (Experimental 
Dist)”, and with the model in (1241) . results labeled “Proposed Alg. (Theoretical Dist)”. For all three algorithms, once the optimal 
subset of reference view is selected, the full navigation window is reconstructed experimentally and the mean PSNR of the 
actual reconstructed sequence is computed. 

In the following, we first validate the distortion model in ([24l) as well as the proposed optimization algorithm. Then, we 
provide simulation using the model in (l24l) and study the gain offered by network synthesis. For the sake of clarity in the 
notation, in the following we identify the viewpoints by their indexes only. This means that the set of camera views 
for example, is denoted in the following by {0,1,3}. Analogously for the navigation window [ 1 ^ 0 . 75 ,^ 5 . 25 ] is denoted in the 
following by [0.75,5.25]. 

B. Performance of the view selection algorithm 

In Fig. m we provide the mean PSNR as a function of the available bandwidth C in the setting of a regular spaced cameras 
setV = {0,1,2,...,5,6}, and a navigation window [0.75, 5.25] requested by the user. Results are provided for the “Mansion” 
and the “Statue” sequences in Fig. |5(a)| and Fig. |5(b)[ respectively. For the “Mansion” sequence, the proposed algorithm with 
experimental distortion perfectly matches the exhaustive search. Also the proposed algorithm based on theoretical distortion 
nicely matches the exhaustive search method, with the exception of the experimental point at C = 4 in the network synthesis 
case. In that experiment, the algorithm selects as best subset Ts = {0.75,2,4,5.25} rather than Ts = {0.75,2,3,5.25} 
selected by the exhaustive search. Beyond the good match between exhaustive search and proposed algorithm. Fig. |5(a)| also 
shows the gain achieved in synthesizing reference views at the cloudlets. For C = 2, the optimal sets of reference views are 
Ts = {0.75,5.25} and Tns = {0,6}- The possibility of selecting the view at position 0.75 as reference view reduced the 
reference view distance for viewpoints in [0.75, 5.25] compared to the case in which camera view 0 is selected. Thus, as long 
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Fig. 5. Validation of the proposed optimization model with equally spaced cameras set V = {0,1, 2,. .., 5, 6}, and a navigation window [0.75, 5.25] for 
“Mansion” and “Statue” sequences. 



Fig. 6. Validation of the proposed optimization model for “Statue” sequence with unequally spaced cameras V = {0,1.5, 2, 2.75, 4, 5, 6} and a navigation 
window [0.75,5.25]. 


as the viewpoint 0.75 is synthesized at a good quality in the network, synthesizing in the network improves the quality of the 
reconstructed region of interest, when the bandwidth C is limited. Increasing the channel capacity reduces the quality gain 
between synthesis and no synthesis at the cloudlets. For C = 4, for example, the virtual viewpoint 0.75 is used to reconstruct 
the views range [0.75, 2) of the navigation window. Thus, the benefit of selecting 0.75 rather than 0 is limited to a portion of 
the navigation window and this portion usually decreases for large C. Similar considerations can be derived from Fig. |5(b)[ 
for the “Statue” sequence. We observe a very good match between the proposed algorithm and the exhaustive search one. 

We then compare in Fig. [6] the performance of the exhaustive search algorithm with our optimization method in the case of 
non-equally spaced cameras. The “Statue” sequence is considered with unequally spaced cameras set V = {0,1.5, 2, 2.75,4, 5,6}, 
and a navigation window [0.75, 5.25] at the client. Similarly to the equally spaced scenario, the performance of proposed 
optimization algorithm matches the one of the exhaustive search. This confirms the validity of our assumptions and the 
optimality of the DP optimization solution. Also in this case, a quality gain is offered by virtual view synthesis in the network, 
with a maximum gain achieved for C = 2, with optimal reference views Ts = {0.75, 5.25} and Tns = {0,6}. 
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Fig. 7. PSNR (in dB) as a function of the channel capacity C for different channel capacity values C for a regular spaced camera set with varying distance 
among cameras, 7 = 0.3, Dj = 300, navigation window [0.75, 5.25], and camera set V = {0,1, 2,..., 5, 6 } (equally spaced cameras). 


TABLE II 

Optimal subsets for the scenario of Fig. [7] 


c 

Ts 

TlS 

2 

{0.75,5.25} 

{0,6} 

3 

{0.75,3,5.25} 

{0,3,6} 

4 

{0.75,2,4,5.25} 

{0,2,4,6} 

5 

{0.75,2,3,4,5.25} 

{0,2,3,4,6} 

6 

{0,1,2,3,4,5.25} 

{0,1,2,3,4,6} 

7 

{0,1,2,3,4,5,6} 

{0,1,2,3,4,5,6} 


C. Network synthesis gain 

Now, we aim at studying the performance gain due to synthesis in the network for different scenarios. However, multiview 
video sequences (with both texture and depth maps) currently available as test sequences have a very limited number of views 
(e.g., 8 views in the Ballet video sequence^). Because of the lack of test sequences, we consider synthetic scenarios and we 
adopt the distortion model in ([24l) both for solving the optimization algorithm and evaluating the system performance. The 
following results are meaningful since we already validated our synthetic distortion model in the previous subsection. 

We consider the cases of equally spaced cameras (V = {0,1,2,...,5,6}) and unequally spaced cameras (V = {0,1,3,5,7,8} 
and V = {0,2, 3,4, 7, 8}) capturing the scene of interest. In Fig. [71 we show the mean PSNR as a function of the available 
channel capacity C when the navigation window requested by the user is [0.75, 5.25] and cameras are equally spaced. The 
distortion of the synthesized viewpoints is evaluated with (l24l) . with 7 = 0.2, Dj = 200, and d = 25. The case of synthesis 
in the network is compared with the one in which only camera views can be sent to clients. In Table HU we show the optimal 
subsets Ts and Tns associated to each simulation point in Fig. [71 where camera views indexes are highlighted in bold. We 
observe that the case with synthesis in the network performs best in terms of quality over the navigation window. When (7 = 2, 
Ts : {0.75, 5.25} for the network synthesis case, and Tns • {0, 6}, otherwise. However, the larger the channel capacity the less 
the need for sending virtual viewpoints. When (7 = 6, for example, both camera views 0 and 1 can be sent, thus there is no 
gain in transmitting only view 0.75. Finally, when (7 = 7 and all camera views can be sent to clients, % = Tns = V, with V 
being the set of camera views. As expected, sending synthesized viewpoints as reference views leads to a quality gain only in 
constrained scenarios in which the channel capacity does not allow to send all views required for reconstructing the navigation 
window of interest. 

We now study the gain in allowing network synthesis when camera views are not equally spaced. In Table [IIIl we provide the 
optimal subsets of reference views for both sets of unequally spaced cameras (V = {0,1,3, 5, 7,8} and V = {0, 2,3,4, 7,8}). 
Similarly to the case of equally spaced cameras, we observe that virtual viewpoints are selected as reference views (i.e., they 
are in the best subset Ts) when the bandwidth (7 is limited. For the camera set a) the virtual view 0.75 is selected as reference 
view also for (7 = 4, while the camera set b) prefers to select the camera views 0, 2 at (7 = 4. This is justified by the fact that 

^http://research.microsoft.com/en-us/um/people/sbkang/3dvideodownload/ 
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TABLE III 


Selected subset oe reeerence views and associated quality eor scenarios with , U^] = [0.75,7.25] ,d = 25 mm, 7 = 0.2, 


D 


max 


200 . 


V = { 0 , 1 , 3 , 5, 7 , 8 }, case a) 

V = 10 , 2 , 3 , 4 , 7 , 8 }, case b) 

C 

Ts 

PSNR 

'Tns 

PSNR 

C 

Ts 

PSNR 

Tns 

PSNR 

2 

{0.75,7.25} 

29.39 

{ 0 , 8 } 

28.04 

2 

{0.75,7.25} 

29.08 

{ 0 , 8 } 

28.04 

3 

{0.75,3,7.25} 

32.35 

{0,3,8} 

31.13 

3 

{0.75,4, 7.25} 

32.33 

{0,4,8} 

31.49 

4 

{0.75,3,5,7.25} 

35.24 

{0,3,5, 8 } 

33.87 

4 

{0,2,4,7.25} 

34.18 

{ 0 , 2 ,4, 8 } 

33.21 

5 

{0,1,3,5,7.25} 

35.85 

{ 0 , 1 ,3,5, 8 } 

35.017 

5 

{ 0 , 2 ,4,7, 8 } 

34.92 

{ 0 , 2 ,4,7, 8 } 

34.92 

6 

{ 0 , 1 ,3,5,7,8} 

36.56 

{MT 

36.56 

6 

{ 0 , 2 ,3,4,7, 8 } 

35.60 

{ 0 , 2 ,4,7, 8 } 

35.60 



Fig. 8 . PSNR (in dB) vs. C/£ for a camera set V = {0, 2, 3, 4}, navigation window [C/£, 4], with d = 50, 7 = 0.2, and Dj = 200. 


in the latter scenario, the viewpoint 0.75 is synthesized from (Vl^Vr) = (0, 2) thus at a larger distortion than the viewpoint 
0.75 in scenario a), where the viewpoint is synthesized from (Vr^Vr) = (0,1). This distortion penalty makes the synthesis 
worthy when the channel bandwidth is highly constrained ((7 = 2,3), but not in the other cases. 

In Fig. m the average quality of the client navigation is provided as a function of the left extreme view of the navigation 
window [f/^, 4] with the camera set V = {0, 2,3, 4} with d = 50, 7 = 0.2, and Dj = 200 in ([24l) . It is worth noting that 
ranges from 0 to 1.875 and only view 0 is a camera view in this range. When (7^ = 0 and (7 = 2, the reference views 0 and 
4 perfectly cover the entire navigation window requested by the user, so there is no need for sending any virtual viewpoint as 
reference view. This is no more true for ( 7 ^ > 0. When the channel capacity is (7 = 2, the gain in allowing synthesis at the 
cloudlets increases with U^. This is justified by the fact that in a very challenging scenario (i.e., limited channel capacity), the 
larger the less efficient is it is to send the reference view 0 to reconstruct images in \U^ , 4] . At the same time, sending 2 
and 4 as reference views would not allow to reconstruct the viewpoints lower than 2. This gain in allowing network synthesis 
is refiected in the PSNR curves of Fig. [8l where we can observe an increasing gap between the case of synthesis allowed and 
not allowed for (7 = 2. This gap is however reduced for the scenario of (7 = 3. This is expected since the navigation window 
is a limited one, at most ranging from 0 to 4 and 3 reference views cover the navigation window pretty well. 

To better show this tradeoff between distortion of the virtual reference view and the bandwidth gain, we introduce the 
thresholding channel value, denoted by . The latter is defined as the value of channel bandwidth beyond which no gain 
is experienced in allowing synthesis in the network compared to a case of no synthesis. In Fig. O we provide the behavior 
of the thresholding channel value as a function the navigation window, for different cameras set. In particular, we consider 
= 0.5 and we let varies from 5 to 10. Also, we simulate three different scenarios that differ for the available camera 
set. In particular, we have V = {0,1, 2,3,...}, V = {0, 2, 3,4,...}, and V = {0, 3,4, 5,...}. The main difference is then 
in the reference views that can be used to synthesize the virtual viewpoint 0.5. In the first case, 0.5 is reconstructed from 
camera views (0,1) while in the last case from (0, 3) increasing then the distortion of the synthesis. Because of this increased 
distortion of 0.5, the virtual viewpoint is not always sent as reference view. In particular, we can observe that the larger the 
distortion of the virtual viewpoint, the lower the thresholding channel value. This means that even in challenging scenarios, 
as for example in the case of = 7 and C = 3, ifV = {0,3,4, 5,...} then there is no gain in synthesize in the network. 
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Fig. 9. Thresholding C vs. for a navigation window [0.5, U^], with d = SOrnm, 7 = 0.2, and Dj = 200. 



Fig. 10. PSNR gain (in dB) vs. the navigation window size A^, for different channel capacity constraints C when d = 50mm, 7 = 0.2, and Di = 200. 


while we still have a gain ifV = In Fig. [TOl we provide the mean PSNR as a function of the size of the 

navigation window, namely More in details, for each value of A^, we define the navigation window as [f/£, + A^]. 

The starting viewpoint is randomly selected. For each realization of the navigation window, the best subset is evaluated 
(both when synthesis is allowed and when it is not) and the quality of the reconstructed viewpoint in the navigation window 
is evaluated. For each A^^, we average the quality simulating all possible starting viewpoint within a total range of [0,12]. 
In the results we provide the PSNR gain, defined as the difference between the mean PSNR (in dB) when the synthesis is 
allowed and the mean PSNR (in dB) when only camera views are considered as reference views. Thus, the figure shows the 
gain in synthesizing for different sizes of the navigation window. As general trend, we observe that the quality gain decreases 
with A^t. This is due to the fact that the gain mainly comes from the lateral reference views, that are usually virtual viewpoints 
if synthesis is allowed. This leads to a gain that is however reduced for large sizes of the navigation window. Finally, we also 
observe that the gain does not necessarily depends on the channel constraint C. 

We now consider a scenario in which the camera views position is not a priori given. In Fig. [TT] we provide the mean 
PSNR as a function of the variance cr^, which defines the randomness of the camera views positions when acquiring the 
scene. More in details, we consider a navigation window [f/£, = [2,6]. We then define a deterministic camera views set 

Vd = {0,1,2,...,6,7}, which is the best camera view set since it is aligned with the requested viewpoint navigation window. 
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Fig. 11. PSNR (in dB) vs. for different channel capacity constraints C when d = 50, 7 = 0.2, and Dj = 200. 



Fig. 12. PSNR (in dB) vs. the sampling distance L, for C = 2, d = 50, 7 = 0.2, and Dj = 200. 


For each value of cr^, we generate a random cameras set V as V = Vd + • • •, nj], where each rii is a gaussian random 

variable with zero mean and variance with rii and rij mutually independent for i ^ j. Thus, the larger cr^, the larger 
the probability for the camera view set to be not aligned with the navigation window. For each realization of V, we run our 
optimization for both the cases of allowed and not allowed synthesis and we evaluate the experienced quality. For each 
value we simulate 400 runs and we provide in Fig. the averaged quality. What it is interesting to observe is that even if 
camera views are not perfectly aligned with the navigation window of interest (i.e., even for large variance values) the quality 
degradation with respect to the case of = 0 is limited, about 0.5 dB for C = 3, when network synthesis is allowed. On the 
contrary, when synthesis is not allowed in the cloudlet, the quality substantially decreases with cr^, experiencing a PSNR loss 
of almost 1.5dB. This means that network synthesis can compensate for cameras not ideally positioned in the 3D scene, as in 
the case of user generated content systems. 

Finally, we study performance of the cloudlet-based view synthesis for a varying number of acquiring cameras. In particular, 
given the set of equally spaced viewpoints U, we assume that one every L viewpoints in Z// is a camera view, i.e., there are 
L — 1 virtual viewpoints between consecutive camerasview. Being the viewpoints in U equally spaced, say at distance d, Ld is 
the distance between consecutive cameras. In the following, we provide the quality behavior for L ranging from 1 to 12. For 
each value of the sampling distance L, we simulate a navigation window spanning a range of 20(i. The navigation window 
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is selected uniformly at random and the optimization algorithm evaluates the best subset of reference views. The experienced 
quality is averaged over 400 runs and evaluated for different values of L. In Fig. [121 we show the mean quality for the 
navigation as a function of the sampling distance L, for the scenario with for C = 2, d = 50, 7 = 0.2, and Dj = 200 in (l24l) . 
It is worth noting that for a user to navigate at given quality, a much higher value of sampling distance L can be used when 
network synthesis is allowed, with respect to the value of L required with no network synthesis. For example, a mean quality 
in the navigation of 33 dB is achieved with L = 5 when network synthesis is not allowed as opposed to L = 10 when allowing 
network synthesis. This means that when synthesis is allowed, half of the number of camera views can be used respect to the 
case in which no synthesis is allowed. Thus, view synthesis in the network allows to maintain a good navigation quality when 
reducing the number of cameras. 


VII. Conclusion 

When interactive multiview video systems face limited bandwidth constraints, we argue that synthesizing reference views in 
the cloud improve the quality of navigation at the client side. In particular, we propose a synthesized reference view selection 
optimization problem aimed at finding the best subset of viewpoints to be transmitted to the decoder as reference views. 
This subset is not limited to captured camera views as in previous approaches but it can also include virtual viewpoints. The 
problem is formalized as a combinatorial optimization problem, which is shown to be NP-hard. However, we show that, under 
the general assumption that the distortion of synthesized viewpoints is well-behaved, the problem can be solved in polynomial 
time via a dynamic programming algorithm. Simulation results validate the performance gain of the proposed method and show 
that synthesizing reference views can improve image quality at the client by up to 2.1dB in PSNR. We finally demonstrate that 
view synthesis in the network obviates to non optimal camera sampling and permits to increase the distance between camera 
views without affecting the quality of the navigation. 
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